Load the dataset from a CSV file and Create DataFrame¶
Print the first few rows of the DataFrame to quickly inspect the data
import pandas as pd
# Load the downloaded CSV file
data = pd.read_csv('knn_authentication_features.csv')
print("Loaded Data:")
data.head()
Loaded Data:
| filename | label | student_id | Mfcc_1 | Mfcc_2 | Mfcc_3 | Mfcc_4 | Mfcc_5 | Mfcc_6 | Mfcc_7 | ... | Spectral_Contrast_2 | Spectral_Contrast_3 | Spectral_Contrast_4 | Spectral_Contrast_5 | Spectral_Contrast_6 | Spectral_Contrast_7 | Zero_Crossing_Rate | RMS_Energy | Spectral_Centroid | Spectral_Bandwidth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | hw1_q2_610399205_male.mp3_0 | male | 610399205 | -94.829977 | 122.175083 | -68.000374 | 47.507958 | -14.407363 | -13.427573 | -23.064053 | ... | 16.537672 | 17.337876 | 15.638074 | 18.102693 | 23.265461 | 66.736837 | 0.105971 | [0.24632083] | 1911.823347 | 1581.462113 |
| 1 | hw1_q2_610399205_male.mp3_1 | male | 610399205 | -123.197408 | 131.659141 | -61.479495 | 44.792372 | -7.927916 | -10.119547 | -21.518692 | ... | 16.173748 | 17.615650 | 15.370934 | 17.944107 | 21.722387 | 64.664952 | 0.079931 | [0.21608743] | 1656.721908 | 1499.382229 |
| 2 | hw1_q2_610399205_male.mp3_2 | male | 610399205 | -122.910702 | 141.014834 | -67.712628 | 39.928565 | -14.785694 | -7.251495 | -24.902940 | ... | 15.759529 | 17.226226 | 15.636203 | 18.222485 | 21.671965 | 63.307834 | 0.096313 | [0.19001988] | 1707.593297 | 1448.149975 |
| 3 | hw1_q2_610399205_male.mp3_3 | male | 610399205 | -140.025639 | 126.584783 | -53.874062 | 40.963797 | -5.784167 | -5.752144 | -22.515332 | ... | 15.341300 | 15.522699 | 14.462609 | 18.687998 | 22.817218 | 61.284887 | 0.099141 | [0.16602059] | 1857.183068 | 1591.809683 |
| 4 | hw1_q2_610399205_male.mp3_4 | male | 610399205 | -127.307332 | 122.204455 | -61.316071 | 54.882303 | -0.544779 | -11.455723 | -28.317545 | ... | 17.254408 | 17.243651 | 14.808253 | 19.110663 | 23.740752 | 67.221951 | 0.096767 | [0.21153131] | 1830.105791 | 1537.769633 |
5 rows × 27 columns
list(data.columns)
['filename', 'label', 'student_id', 'Mfcc_1', 'Mfcc_2', 'Mfcc_3', 'Mfcc_4', 'Mfcc_5', 'Mfcc_6', 'Mfcc_7', 'Mfcc_8', 'Mfcc_9', 'Mfcc_10', 'Mfcc_11', 'Mfcc_12', 'Mfcc_13', 'Spectral_Contrast_1', 'Spectral_Contrast_2', 'Spectral_Contrast_3', 'Spectral_Contrast_4', 'Spectral_Contrast_5', 'Spectral_Contrast_6', 'Spectral_Contrast_7', 'Zero_Crossing_Rate', 'RMS_Energy', 'Spectral_Centroid', 'Spectral_Bandwidth']
Initially, the values in the RMS_Energy column are stored as strings, and they contain square brackets. This format is not suitable for numerical analysis, and hence, we need to clean and convert the data.
data['RMS_Energy'] = data['RMS_Energy'].astype(str).str.replace('[', '', regex=False).str.replace(']', '', regex=False).astype(float)
K_fold_Crossvalidation Function:¶
This function provides an efficient way to compare classification models and evaluate various features and feature reduction techniques.
Using the KFold class from sklearn.model_selection, the data is split into k folds (default is 5). This ensures that each fold is used as a test set once, and as part of the training set k-1 times.
Standard scaling is applied to the feature sets using StandardScaler to normalize the data
If a feature reduction function is provided, we fit it using the training data and then apply it to the test data. This step ensures that the feature reduction technique is appropriately trained on the training set before being applied to unseen test data.
The average accuracy score across all folds are calculated to provide an overall performance metric. Additionally, the average confusion matrix is plotted to visualize the performance.
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
def k_fold_cross_validation(df, feature_names, target_name, model, feature_reduction_func, k=5):
"""
Perform k-fold cross-validation on the given data.
Parameters:
df (pd.DataFrame): The input data frame.
feature_names (list): List of feature column names.
target_name (str): The name of the target column.
model: The classification model to be used.
feature_reduction_func: The feature reduction function to be applied.
k (int): The number of folds for cross-validation (default is 5).
Returns:
float: The average accuracy score across all folds.
"""
X = df[feature_names]
y = df[target_name]
kf = KFold(n_splits=k, shuffle=True, random_state=42)
accuracies = []
confusion_matrices = []
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Feature reduction
if(feature_reduction_func):
X_train_scaled = feature_reduction_func.fit_transform(X_train_scaled, y_train)
X_test_scaled = feature_reduction_func.transform(X_test_scaled)
model.fit(X_train_scaled, y_train)
# Make predictions
#test_accuracy = model.evaluate(X_test_scaled, y_test)
y_pred = model.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
confusion_matrices.append(conf_matrix)
# Calculate average accuracy
avg_accuracy = np.mean(accuracies)*100
std_accuracy = np.std(accuracies)*100
# Plot the average confusion matrix
avg_conf_matrix = np.mean(confusion_matrices, axis=0)
plt.figure(figsize=(5, 3))
sns.heatmap(avg_conf_matrix, annot=True, fmt='.2f', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Average Confusion Matrix')
plt.show()
return avg_accuracy, std_accuracy
Visualization of Cross-Validation Results¶
Plot_Results_Barchart¶
To effectively compare the performance of various classification models and feature reduction techniques, we utilize a bar chart visualization. The Plot_Results_Barchart function is designed to generate and display these bar charts, providing clear and intuitive insights into the cross-validation results.
For each feature reduction method, the function filters the results and plots a bar chart showing the mean accuracy of different models for each feature set. The maximum accuracy for each feature reduction method is printed, and bars with the highest accuracy are highlighted with red edges.
def Plot_Results_Barchart(results_df):
# List of feature reduction methods
feature_reductions = results_df['Feature Reduction'].unique()
# Set a pleasing color palette
sns.set_palette("Set2")
# Loop to plot bar charts for each feature reduction method
for reduction in feature_reductions:
# Filter data for the current feature reduction method
filtered_results = results_df[results_df['Feature Reduction'] == reduction]
max_value = filtered_results['Mean Accuracy'].max()
print(f"Maximum accuracy if feature reduction({reduction}): {max_value:.2f}%")
# Plot the bar chart
plt.figure(figsize=(7, 5))
ax = sns.barplot(x='Feature Set', y='Mean Accuracy', hue='Model', data=filtered_results, errorbar=None, width=0.4)
# Add solid lines around each bar
for bar in ax.patches:
if bar.get_height() == max_value:
bar.set_edgecolor('red')
bar.set_linewidth(2)
else:
bar.set_edgecolor('black')
bar.set_linewidth(0.5)
# Set titles and labels
plt.title(f'Cross-Validation Mean Accuracy for Different Models ({reduction})', fontsize=10)
plt.xlabel('Feature Set', fontsize=8)
plt.ylabel('Mean Accuracy', fontsize=8)
plt.ylim(0, 100)
plt.xticks(fontsize=8)
plt.yticks(fontsize=8)
plt.legend(title='Model', loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=3, fontsize=8)
plt.tight_layout()
plt.show()
Classification Using Different Models, Feature Reduction, and Feature Sets¶
Run_Classification¶
The Run_Classification function is a comprehensive approach designed to evaluate the performance of various classification models using different feature sets and feature reduction techniques. This function leverages k-fold cross-validation(k=4) to provide robust performance metrics for each combination of model, feature set, and feature reduction method.
Various classification models are defined for evaluation:
- K-Nearest Neighbors (KNN)
- Logistic Regression
- Support Vector Machine (SVM)
Feature Reduction Techniques:
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)
- No feature reduction (None)
Defining Feature Sets:
- mfcc_features: Mel-Frequency Cepstral Coefficients (MFCC) features.
- spectral_contrast_features: Spectral Contrast features.
- combined_features: A combination of MFCC and Spectral Contrast features.
- time_features: Time-domain features such as Zero Crossing Rate and RMS Energy.
- Spec_cent_Bw: Spectral Centroid and Spectral Bandwidth features.
- all_features: A combination of all the above feature sets.
The results DataFrame, containing the performance metrics for each combination.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
def Run_Classification(balanced_dataFrame):
# Define the feature sets
mfcc_features = [f'Mfcc_{i+1}' for i in range(13)]
spectral_contrast_features = [f'Spectral_Contrast_{i+1}' for i in range(7)]
combined_features = mfcc_features + spectral_contrast_features
time_features = ['Zero_Crossing_Rate', 'RMS_Energy']
Spec_cent_Bw = ['Spectral_Centroid', 'Spectral_Bandwidth']
all_features = combined_features + time_features + Spec_cent_Bw
balanced_dataFrame['RMS_Energy'] = balanced_dataFrame['RMS_Energy'].astype(str).str.replace('[', '', regex=False).str.replace(']', '', regex=False).astype(float)
# Define the target column
target_column = 'class'
# Define the models
models = {
'KNN': KNeighborsClassifier(n_neighbors=5),
'Logistic Regression': LogisticRegression( max_iter=200, random_state= 42),
'SVM': SVC(kernel='linear', random_state=42)
}
# Define the feature sets
feature_sets = {
'time_domain_features' : time_features,
'MFCC': mfcc_features,
'Spectral Contrast': spectral_contrast_features,
'MFCC&Sp_Contrast': combined_features,
'Spec_Cent&BW': Spec_cent_Bw,
'all_features' : all_features
}
feature_reductions = {
'LDA': LinearDiscriminantAnalysis(),
'PCA': PCA(),
'None': None
}
# Run cross-validation and store results
results = []
for model_name, model in models.items():
for feature_set_name, feature_columns in feature_sets.items():
for reduction_name, reduction_func in feature_reductions.items():
print(f'\nModel({model_name}) using feature({feature_set_name}) and feature reduction({reduction_name})\n')
mean_accuracy, std_accuracy = k_fold_cross_validation(balanced_dataFrame, feature_columns, target_column, model, reduction_func, k=4)
results.append({
'Model': model_name,
'Feature Set': feature_set_name,
'Feature Reduction': reduction_name,
'Mean Accuracy': mean_accuracy,
'Std Accuracy': std_accuracy
})
results_df = pd.DataFrame(results)
return results_df
Random Student Set1¶
Select 6 random Student and Ensure each student has an equal number of samples and visualize some features using pairplots.
Random Student Selection: We begin by randomly selecting 6 students from the dataset. This ensures that our analysis is not biased towards any specific subset of students.
import random
random.seed(42)
# Get unique student IDs
student_ids = data['student_id'].unique()
# Randomly select 6 students
selected_students = random.sample(list(student_ids), 6)
print(f"Selected Students: {selected_students}")
Selected Students: [np.int64(810103183), np.int64(810100091), np.int64(810198554), np.int64(810103317), np.int64(810199489), np.int64(810101465)]
Equal Number of Samples: For each of the selected students, we ensure that an equal number of samples are used for further analysis. This step is crucial to maintain consistency and fairness in our analysis.
from sklearn.utils import shuffle
filtered_data = data[data['student_id'].isin(selected_students)]
# Ensure each student has an equal number of samples
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
balanced_data = shuffle(balanced_data, random_state=42)
balanced_data.reset_index(drop=True, inplace=True)
print("Number of samples for each student in the balanced DataFrame:")
print(balanced_data['student_id'].value_counts())
Number of samples for each student in the balanced DataFrame: student_id 810103183 123 810101465 123 810198554 123 810199489 123 810103317 123 810100091 123 Name: count, dtype: int64
/tmp/ipykernel_15412/3536119780.py:6: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
Pairplot Visualization:¶
Pairplots are used to visualize the relationships between different features for the selected students. This provides an intuitive and visual understanding of the data distribution and correlations.
MFCC features
import seaborn as sns
mfcc_features = [f'Mfcc_{i+1}' for i in range(13)]
# Plot pair plot
palette = sns.color_palette("tab10", balanced_data['student_id'].nunique())
sns.pairplot(balanced_data, hue='student_id', vars= mfcc_features, palette=palette)
# Display the plot
plt.show()
Spectral_Contrast features
import seaborn as sns
spectral_contrast_features = [f'Spectral_Contrast_{i+1}' for i in range(7)]
# Plot pair plot
palette = sns.color_palette("tab10", balanced_data['student_id'].nunique())
sns.pairplot(balanced_data, hue='student_id', vars= spectral_contrast_features, palette=palette)
# Display the plot
plt.show()
Time domain features: Zero Crossing Rate and RMS Energy
import seaborn as sns
time_features = ['Zero_Crossing_Rate', 'RMS_Energy']
# Plot pair plot
palette = sns.color_palette("tab10", balanced_data['student_id'].nunique())
sns.pairplot(balanced_data, hue='student_id', vars= time_features, palette=palette)
# Display the plot
plt.show()
Classification and Analysis for Student Set1¶
Mapping Student IDs to Class Labels: We create a mapping from student IDs to class labels. Each selected student is assigned a unique class label, starting from 0 up to the number of selected students minus one.
# Map student IDs to class labels (e.g., 0 to 5)
class_mapping = {student_id: idx for idx, student_id in enumerate(selected_students)}
balanced_data['class'] = balanced_data['student_id'].map(class_mapping)
print("Balanced DataFrame with Class Column:")
balanced_data.tail()
Balanced DataFrame with Class Column:
| filename | label | student_id | Mfcc_1 | Mfcc_2 | Mfcc_3 | Mfcc_4 | Mfcc_5 | Mfcc_6 | Mfcc_7 | ... | Spectral_Contrast_3 | Spectral_Contrast_4 | Spectral_Contrast_5 | Spectral_Contrast_6 | Spectral_Contrast_7 | Zero_Crossing_Rate | RMS_Energy | Spectral_Centroid | Spectral_Bandwidth | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 733 | hw1_q1_810100091_male.mp3_14 | male | 810100091 | -209.270224 | 159.299467 | -12.862626 | 38.995735 | -8.071367 | 1.210875 | -26.295211 | ... | 18.579838 | 16.371176 | 18.578580 | 18.729660 | 57.705217 | 0.045901 | 0.146539 | 1065.860536 | 1343.465670 | 1 |
| 734 | hw1_q4_810100091_male.mp3_35 | male | 810100091 | -180.091125 | 158.623054 | -8.343020 | 38.221878 | 0.616179 | 6.856343 | -26.354627 | ... | 17.178651 | 14.908284 | 17.176830 | 17.841851 | 58.990186 | 0.047214 | 0.206644 | 1160.097436 | 1424.149759 | 1 |
| 735 | hw1_q4_810103183_male.mp3_2 | male | 810103183 | -197.288841 | 153.435223 | -21.336439 | 39.097099 | -28.892864 | -4.788617 | -17.154573 | ... | 18.552319 | 16.887290 | 18.161144 | 22.510156 | 58.677515 | 0.074786 | 0.160438 | 1388.911960 | 1346.191071 | 0 |
| 736 | hw1_q4_810103317_male.mp3_27 | male | 810103317 | -196.783701 | 156.871713 | 2.812224 | 30.233925 | 4.374440 | -1.058855 | -26.090089 | ... | 20.248830 | 16.543377 | 19.366785 | 18.958355 | 58.715747 | 0.054211 | 0.242592 | 1063.924159 | 1260.229348 | 3 |
| 737 | hw1_q1_810100091_male.mp3_9 | male | 810100091 | -161.626642 | 153.208495 | -11.947535 | 47.287845 | -4.489833 | -2.877725 | -31.232691 | ... | 19.821527 | 17.225590 | 17.476504 | 18.599111 | 60.334727 | 0.049144 | 0.271515 | 1141.883666 | 1375.061913 | 1 |
5 rows × 28 columns
Classification Process and Average Confusion Matrix¶
Calling the Classification Function¶
The Run_Classification function is called with the balanced dataset (balanced_data) as its input. This function performs k-fold cross-validation for various combinations of classification models, feature sets, and feature reduction techniques. The Result_DF DataFrame now contains detailed performance metrics for various combinations of models, feature sets, and feature reduction techniques. This structured summary facilitates easy comparison and selection of the most effective approaches for the classification task.
Result_DF = Run_Classification(balanced_data)
Model(KNN) using feature(time_domain_features) and feature reduction(LDA)
Model(KNN) using feature(time_domain_features) and feature reduction(PCA)
Model(KNN) using feature(time_domain_features) and feature reduction(None)
Model(KNN) using feature(MFCC) and feature reduction(LDA)
Model(KNN) using feature(MFCC) and feature reduction(PCA)
Model(KNN) using feature(MFCC) and feature reduction(None)
Model(KNN) using feature(Spectral Contrast) and feature reduction(LDA)
Model(KNN) using feature(Spectral Contrast) and feature reduction(PCA)
Model(KNN) using feature(Spectral Contrast) and feature reduction(None)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(None)
Model(KNN) using feature(all_features) and feature reduction(LDA)
Model(KNN) using feature(all_features) and feature reduction(PCA)
Model(KNN) using feature(all_features) and feature reduction(None)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(LDA)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(PCA)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(None)
Model(Logistic Regression) using feature(MFCC) and feature reduction(LDA)
Model(Logistic Regression) using feature(MFCC) and feature reduction(PCA)
Model(Logistic Regression) using feature(MFCC) and feature reduction(None)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(LDA)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(PCA)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(None)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(None)
Model(Logistic Regression) using feature(all_features) and feature reduction(LDA)
Model(Logistic Regression) using feature(all_features) and feature reduction(PCA)
Model(Logistic Regression) using feature(all_features) and feature reduction(None)
Model(SVM) using feature(time_domain_features) and feature reduction(LDA)
Model(SVM) using feature(time_domain_features) and feature reduction(PCA)
Model(SVM) using feature(time_domain_features) and feature reduction(None)
Model(SVM) using feature(MFCC) and feature reduction(LDA)
Model(SVM) using feature(MFCC) and feature reduction(PCA)
Model(SVM) using feature(MFCC) and feature reduction(None)
Model(SVM) using feature(Spectral Contrast) and feature reduction(LDA)
Model(SVM) using feature(Spectral Contrast) and feature reduction(PCA)
Model(SVM) using feature(Spectral Contrast) and feature reduction(None)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(None)
Model(SVM) using feature(all_features) and feature reduction(LDA)
Model(SVM) using feature(all_features) and feature reduction(PCA)
Model(SVM) using feature(all_features) and feature reduction(None)
Sorting and Displaying Top Results¶
By sorting the results DataFrame by mean accuracy and displaying the top rows, we efficiently identify the best-performing combinations of models, feature sets, and feature reduction techniques.
df_sorted = Result_DF.sort_values(by='Mean Accuracy', ascending=False)
df_sorted.head()
| Model | Feature Set | Feature Reduction | Mean Accuracy | Std Accuracy | |
|---|---|---|---|---|---|
| 15 | KNN | all_features | LDA | 99.729730 | 0.468122 |
| 34 | Logistic Regression | all_features | PCA | 99.459459 | 0.540541 |
| 51 | SVM | all_features | LDA | 99.459459 | 0.540541 |
| 33 | Logistic Regression | all_features | LDA | 99.459459 | 0.540541 |
| 35 | Logistic Regression | all_features | None | 99.459459 | 0.540541 |
Ploting Results¶
The line Plot_Results_Barchart(Result_DF) calls the Plot_Results_Barchart function and passes the Result_DF DataFrame as an argument. This function generates and displays bar charts that visually compare the performance of various classification models and feature reduction techniques based on the cross-validation results stored in Result_DF.
Plot_Results_Barchart(Result_DF)
Maximum accuracy if feature reduction(LDA): 99.73%
Maximum accuracy if feature reduction(PCA): 99.46%
Maximum accuracy if feature reduction(None): 99.46%
Random Student Set2¶
import random
random.seed(8)
# Get unique student IDs
student_ids = data['student_id'].unique()
# Randomly select 6 students
selected_students = random.sample(list(student_ids), 6)
print(f"Selected Students: {selected_students}")
Selected Students: [np.int64(810102148), np.int64(810100193), np.int64(810600133), np.int64(810103054), np.int64(810100206), np.int64(810100168)]
from sklearn.utils import shuffle
filtered_data = data[data['student_id'].isin(selected_students)]
# Ensure each student has an equal number of samples
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
balanced_data = shuffle(balanced_data, random_state=42)
balanced_data.reset_index(drop=True, inplace=True)
print("Number of samples for each student in the balanced DataFrame:")
print(balanced_data['student_id'].value_counts())
Number of samples for each student in the balanced DataFrame: student_id 810600133 97 810100206 97 810100193 97 810103054 97 810102148 97 810100168 97 Name: count, dtype: int64
/tmp/ipykernel_15412/3536119780.py:6: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
Classification and Analysis for Student Set2¶
# Map student IDs to class labels (e.g., 0 to 5)
class_mapping = {student_id: idx for idx, student_id in enumerate(selected_students)}
balanced_data['class'] = balanced_data['student_id'].map(class_mapping)
print("Balanced DataFrame with Class Column:")
balanced_data.tail()
Balanced DataFrame with Class Column:
| filename | label | student_id | Mfcc_1 | Mfcc_2 | Mfcc_3 | Mfcc_4 | Mfcc_5 | Mfcc_6 | Mfcc_7 | ... | Spectral_Contrast_3 | Spectral_Contrast_4 | Spectral_Contrast_5 | Spectral_Contrast_6 | Spectral_Contrast_7 | Zero_Crossing_Rate | RMS_Energy | Spectral_Centroid | Spectral_Bandwidth | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 577 | hw1_q6_810100168_female.mp3_56 | female | 810100168 | -211.851905 | 132.622349 | -26.187861 | 7.499521 | -15.446996 | -16.373846 | -20.007774 | ... | 18.901962 | 18.813562 | 19.578638 | 19.716234 | 57.092843 | 0.086897 | 0.141222 | 1430.544644 | 1292.170410 | 5 |
| 578 | hw1_q2_810100193_female.mp3_9 | female | 810100193 | -184.267192 | 124.399497 | 5.529370 | 41.731653 | -26.419279 | -8.033805 | -17.266343 | ... | 15.281587 | 16.121414 | 16.699920 | 16.846284 | 65.358917 | 0.055890 | 0.229206 | 1321.779222 | 1616.504355 | 1 |
| 579 | hw1_q6_810100206_male.mp3_78 | male | 810100206 | -264.700924 | 180.204484 | 16.873536 | 30.587170 | -10.826312 | -5.214259 | -29.595162 | ... | 18.114748 | 16.465813 | 17.164280 | 16.028635 | 51.030329 | 0.051165 | 0.117173 | 893.509100 | 1109.506454 | 4 |
| 580 | hw1_q6_810103054_male.mp3_22 | male | 810103054 | -223.577594 | 137.644504 | -7.995220 | 38.862804 | -18.405657 | 14.722776 | -18.204598 | ... | 19.636023 | 14.571491 | 18.150354 | 18.250443 | 59.959546 | 0.041126 | 0.132037 | 1133.692981 | 1511.824107 | 3 |
| 581 | hw1_q3_810100193_female.mp3_17 | female | 810100193 | -177.964937 | 146.832626 | 2.405349 | 40.226782 | -23.222275 | -23.508371 | -36.809802 | ... | 19.449307 | 18.994886 | 17.612312 | 17.172073 | 59.866017 | 0.049638 | 0.214573 | 1117.108282 | 1373.884490 | 1 |
5 rows × 28 columns
Classification Process and Average Confusion Matrix¶
Calling the Classification Function¶
Result_DF = Run_Classification(balanced_data)
Model(KNN) using feature(time_domain_features) and feature reduction(LDA)
Model(KNN) using feature(time_domain_features) and feature reduction(PCA)
Model(KNN) using feature(time_domain_features) and feature reduction(None)
Model(KNN) using feature(MFCC) and feature reduction(LDA)
Model(KNN) using feature(MFCC) and feature reduction(PCA)
Model(KNN) using feature(MFCC) and feature reduction(None)
Model(KNN) using feature(Spectral Contrast) and feature reduction(LDA)
Model(KNN) using feature(Spectral Contrast) and feature reduction(PCA)
Model(KNN) using feature(Spectral Contrast) and feature reduction(None)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(None)
Model(KNN) using feature(all_features) and feature reduction(LDA)
Model(KNN) using feature(all_features) and feature reduction(PCA)
Model(KNN) using feature(all_features) and feature reduction(None)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(LDA)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(PCA)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(None)
Model(Logistic Regression) using feature(MFCC) and feature reduction(LDA)
Model(Logistic Regression) using feature(MFCC) and feature reduction(PCA)
Model(Logistic Regression) using feature(MFCC) and feature reduction(None)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(LDA)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(PCA)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(None)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(None)
Model(Logistic Regression) using feature(all_features) and feature reduction(LDA)
Model(Logistic Regression) using feature(all_features) and feature reduction(PCA)
Model(Logistic Regression) using feature(all_features) and feature reduction(None)
Model(SVM) using feature(time_domain_features) and feature reduction(LDA)
Model(SVM) using feature(time_domain_features) and feature reduction(PCA)
Model(SVM) using feature(time_domain_features) and feature reduction(None)
Model(SVM) using feature(MFCC) and feature reduction(LDA)
Model(SVM) using feature(MFCC) and feature reduction(PCA)
Model(SVM) using feature(MFCC) and feature reduction(None)
Model(SVM) using feature(Spectral Contrast) and feature reduction(LDA)
Model(SVM) using feature(Spectral Contrast) and feature reduction(PCA)
Model(SVM) using feature(Spectral Contrast) and feature reduction(None)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(None)
Model(SVM) using feature(all_features) and feature reduction(LDA)
Model(SVM) using feature(all_features) and feature reduction(PCA)
Model(SVM) using feature(all_features) and feature reduction(None)
Sorting and displaying Top Results¶
df_sorted = Result_DF.sort_values(by='Mean Accuracy', ascending=False)
df_sorted.head()
| Model | Feature Set | Feature Reduction | Mean Accuracy | Std Accuracy | |
|---|---|---|---|---|---|
| 47 | SVM | MFCC&Sp_Contrast | None | 99.655172 | 0.344828 |
| 46 | SVM | MFCC&Sp_Contrast | PCA | 99.655172 | 0.344828 |
| 53 | SVM | all_features | None | 99.485120 | 0.297272 |
| 52 | SVM | all_features | PCA | 99.485120 | 0.297272 |
| 10 | KNN | MFCC&Sp_Contrast | PCA | 99.312707 | 0.002362 |
Ploting Results¶
Plot_Results_Barchart(Result_DF)
Maximum accuracy if feature reduction(LDA): 98.97%
Maximum accuracy if feature reduction(PCA): 99.66%
Maximum accuracy if feature reduction(None): 99.66%
Random Student Set3¶
import random
random.seed(1403)
# Get unique student IDs
student_ids = data['student_id'].unique()
# Randomly select 6 students
selected_students = random.sample(list(student_ids), 6)
print(f"Selected Students: {selected_students}")
Selected Students: [np.int64(810101401), np.int64(610300070), np.int64(810103054), np.int64(810100261), np.int64(810103317), np.int64(810101456)]
from sklearn.utils import shuffle
filtered_data = data[data['student_id'].isin(selected_students)]
# Ensure each student has an equal number of samples
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
balanced_data = shuffle(balanced_data, random_state=42)
balanced_data.reset_index(drop=True, inplace=True)
print("Number of samples for each student in the balanced DataFrame:")
print(balanced_data['student_id'].value_counts())
Number of samples for each student in the balanced DataFrame: student_id 810103054 139 810103317 139 810101401 139 810100261 139 610300070 139 810101456 139 Name: count, dtype: int64
/tmp/ipykernel_15412/3536119780.py:6: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
Classification and Analysis for Student Set3¶
# Map student IDs to class labels (e.g., 0 to 5)
class_mapping = {student_id: idx for idx, student_id in enumerate(selected_students)}
balanced_data['class'] = balanced_data['student_id'].map(class_mapping)
print("Balanced DataFrame with Class Column:")
balanced_data.tail()
Balanced DataFrame with Class Column:
| filename | label | student_id | Mfcc_1 | Mfcc_2 | Mfcc_3 | Mfcc_4 | Mfcc_5 | Mfcc_6 | Mfcc_7 | ... | Spectral_Contrast_3 | Spectral_Contrast_4 | Spectral_Contrast_5 | Spectral_Contrast_6 | Spectral_Contrast_7 | Zero_Crossing_Rate | RMS_Energy | Spectral_Centroid | Spectral_Bandwidth | class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 829 | hw1_q1_610300070_female.mp3_78 | female | 610300070 | -175.235198 | 102.698340 | -29.725138 | 42.060783 | -30.536303 | -2.644054 | -35.779279 | ... | 22.399559 | 18.455236 | 19.463346 | 21.840109 | 63.809065 | 0.081916 | 0.253871 | 1622.742829 | 1600.531925 | 1 |
| 830 | hw1_q5_610300070_female.mp3_3 | female | 610300070 | -228.418644 | 113.546638 | -1.372091 | 31.949328 | -20.382214 | -5.201409 | -15.191473 | ... | 20.273068 | 16.249352 | 17.684594 | 22.089427 | 63.913752 | 0.061365 | 0.159174 | 1363.107072 | 1594.767166 | 1 |
| 831 | hw1_q1_810100261_male.mp3.mp3_1 | male | 810100261 | -210.533109 | 107.467373 | -34.426835 | 39.903486 | 1.809619 | -4.967549 | -22.326968 | ... | 21.508174 | 19.111921 | 21.422158 | 26.465331 | 62.774750 | 0.108192 | 0.148817 | 1925.854474 | 1603.347033 | 3 |
| 832 | hw1_q1_810101456_female.mp3_98 | female | 810101456 | -202.050703 | 95.903137 | -48.087167 | 19.064818 | -20.076070 | -17.911852 | -34.061431 | ... | 24.042378 | 20.284681 | 22.179613 | 20.614573 | 65.130201 | 0.077171 | 0.185471 | 1654.069415 | 1611.754747 | 5 |
| 833 | hw1_q1_610300070_female.mp3_18 | female | 610300070 | -223.128404 | 105.228311 | -8.214977 | 41.257315 | -31.044399 | 4.772623 | -33.162689 | ... | 20.590256 | 16.414522 | 19.069961 | 19.842611 | 62.274789 | 0.078410 | 0.159881 | 1593.538096 | 1701.605259 | 1 |
5 rows × 28 columns
Classification Process and Average Confusion Matrix¶
Calling the Classification Function¶
Result_DF = Run_Classification(balanced_data)
Model(KNN) using feature(time_domain_features) and feature reduction(LDA)
Model(KNN) using feature(time_domain_features) and feature reduction(PCA)
Model(KNN) using feature(time_domain_features) and feature reduction(None)
Model(KNN) using feature(MFCC) and feature reduction(LDA)
Model(KNN) using feature(MFCC) and feature reduction(PCA)
Model(KNN) using feature(MFCC) and feature reduction(None)
Model(KNN) using feature(Spectral Contrast) and feature reduction(LDA)
Model(KNN) using feature(Spectral Contrast) and feature reduction(PCA)
Model(KNN) using feature(Spectral Contrast) and feature reduction(None)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(None)
Model(KNN) using feature(all_features) and feature reduction(LDA)
Model(KNN) using feature(all_features) and feature reduction(PCA)
Model(KNN) using feature(all_features) and feature reduction(None)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(LDA)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(PCA)
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(None)
Model(Logistic Regression) using feature(MFCC) and feature reduction(LDA)
Model(Logistic Regression) using feature(MFCC) and feature reduction(PCA)
Model(Logistic Regression) using feature(MFCC) and feature reduction(None)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(LDA)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(PCA)
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(None)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(None)
Model(Logistic Regression) using feature(all_features) and feature reduction(LDA)
Model(Logistic Regression) using feature(all_features) and feature reduction(PCA)
Model(Logistic Regression) using feature(all_features) and feature reduction(None)
Model(SVM) using feature(time_domain_features) and feature reduction(LDA)
Model(SVM) using feature(time_domain_features) and feature reduction(PCA)
Model(SVM) using feature(time_domain_features) and feature reduction(None)
Model(SVM) using feature(MFCC) and feature reduction(LDA)
Model(SVM) using feature(MFCC) and feature reduction(PCA)
Model(SVM) using feature(MFCC) and feature reduction(None)
Model(SVM) using feature(Spectral Contrast) and feature reduction(LDA)
Model(SVM) using feature(Spectral Contrast) and feature reduction(PCA)
Model(SVM) using feature(Spectral Contrast) and feature reduction(None)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(None)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(LDA)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(PCA)
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(None)
Model(SVM) using feature(all_features) and feature reduction(LDA)
Model(SVM) using feature(all_features) and feature reduction(PCA)
Model(SVM) using feature(all_features) and feature reduction(None)
Sorting and displaying Top Results¶
df_sorted = Result_DF.sort_values(by='Mean Accuracy', ascending=False)
df_sorted.head()
| Model | Feature Set | Feature Reduction | Mean Accuracy | Std Accuracy | |
|---|---|---|---|---|---|
| 9 | KNN | MFCC&Sp_Contrast | LDA | 99.880383 | 0.207183 |
| 51 | SVM | all_features | LDA | 99.880383 | 0.207183 |
| 33 | Logistic Regression | all_features | LDA | 99.880383 | 0.207183 |
| 34 | Logistic Regression | all_features | PCA | 99.880383 | 0.207183 |
| 35 | Logistic Regression | all_features | None | 99.880383 | 0.207183 |
Ploting Results¶
Plot_Results_Barchart(Result_DF)
Maximum accuracy if feature reduction(LDA): 99.88%
Maximum accuracy if feature reduction(PCA): 99.88%
Maximum accuracy if feature reduction(None): 99.88%